laro1
1
I converted huggignface whisper model to onnx with optimum-cli
:
optimum-cli export onnx --model openai/whisper-small.en whispersmallen
I got 4 onnx files:
decoder_model_merged.onnx
decoder_model.onnx
decoder_with_past_model.onnx
encoder_model.onnx
Now I want to write code which loads whisper (as onnx
) and run it on 1.wav
file.
- How to do it ?
- When using hf whisper model, I just run one model (and not 2 sperates models: encoder/decdoer)
1 Like
1. Install Required Libraries
pip install onnxruntime librosa transformers numpy
2. Preprocess Audio into Log-Mel Spectrogram
import numpy as np
import librosa
from transformers import WhisperFeatureExtractor
# Load audio
audio, sr = librosa.load("1.wav", sr=16000)
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small.en",sampling_rate=16000)
# Convert to log-mel spectrogram
inputs = feature_extractor(audio, return_tensors="np",sampling_rate=16000)
input_features = inputs["input_features"] # shape: (1, 80, 3000)
3. Load ONNX Encoder and Run It
import onnxruntime as ort
# Load encoder
encoder_sess = ort.InferenceSession("whispersmallen/encoder_model.onnx")
# Run encoder
encoder_outputs = encoder_sess.run(
output_names=["last_hidden_state"],
input_feed={"input_features": input_features}
)[0]
4. Autoregressive Decoding Loop
Whisper uses decoder input tokens (decoder_input_ids
) and the encoder_hidden_states
to generate tokens one by one.
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small.en")
decoder_sess = ort.InferenceSession("whispersmallen/decoder_model.onnx")
# Start with <|startoftranscript|>
decoder_input_ids = np.array([[tokenizer.convert_tokens_to_ids("<|startoftranscript|>")]], dtype=np.int64)
generated_ids = []
for _ in range(100): # max 100 tokens
outputs = decoder_sess.run(
output_names=["logits"],
input_feed={
"input_ids": decoder_input_ids,
"encoder_hidden_states": encoder_outputs
}
)
next_token_logits = outputs[0][:, -1, :] # shape: (1, vocab_size)
next_token_id = np.argmax(next_token_logits, axis=-1)[0]
if next_token_id == tokenizer.eos_token_id:
break
generated_ids.append(next_token_id)
decoder_input_ids = np.append(decoder_input_ids, [[next_token_id]], axis=-1)
5. Decode Output
transcription = tokenizer.decode(generated_ids, skip_special_tokens=True)
print("Transcription:", transcription)
Summary
- You now need to explicitly handle encoder and decoder.
- ONNX does not wrap both in one model.
- The decoder loop is autoregressive: it feeds its output token as input in the next step.
- Pre/postprocessing can still use HuggingFace.
ChatGPT provided this, and I tested it — it works.
1 Like